Semi-automatic Construction of a Tree-annotated Corpus Using an Iterative Learning Statistical Language Model

نویسندگان

  • Kiyoaki Shirai
  • Hozumi Tanaka
  • Takenobu Tokunaga
چکیده

In this paper, we propose a method to construct a tree-annotated corpus, when a certain statistical parsing system exists and no tree-annotated corpus is available as training data. The basic idea of our method is to sequentially annotate plain text inputs with syntactic trees using a parser with a statistical language model, and iteratively retrain the statistical language model over the obtained annotated trees. The major characteristics of our method are as follows: (1)in the first step of the iterative learning process, we manually construct a tree-annotated corpus to initialize the statistical language model over, and (2) at each step of the parse tree annotation process, we use both syntactic statistics obtained from the iterative learning process and lexical statistics pre-derived from existing language resources, to choose the most probable

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

روشی جدید جهت استخراج موجودیت‌های اسمی در عربی کلاسیک

In Natural Language Processing (NLP) studies, developing resources and tools makes a contribution to extension and effectiveness of researches in each language. In recent years, Arabic Named Entity Recognition (ANER) has been considered by NLP researchers due to a significant impact on improving other NLP tasks such as Machine translation, Information retrieval, question answering, query result...

متن کامل

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

Building a Tree-Bank of Modern Hebrew Text

This paper describes the process of building the first tree-bank for Modern Hebrew texts. A major concern in this process is the need for reducing the cost of manual annotation by the use of automatic means. To this end, the joint utility of an automatic morphological analyzer, a probabilistic parser and a small manually annotated tree-bank was explored. An initial tree-bank that consists of 50...

متن کامل

An Annotated Corpus for Machine Reading of Instructions in Wet Lab Protocols

We describe an effort to annotate a corpus of natural language instructions consisting of 662 wet lab protocols to facilitate automatic or semi-automatic conversion of protocols into a machine-readable format and benefit biological research. Experimental results demonstrate the utility of our corpus for developing machine learning approaches to shallow semantic parsing of instructional texts. W...

متن کامل

A CART approach for Duration Modeling of Greek Phonemes

This paper describes the construction and evaluation of a segmental duration prediction model for Greek language with the application of CART (Classification and Regression Tree) machine learning approach. A ToBI annotated prosodic speech corpus was utilized for the construction of training and testing sets. Our phoneme category was composed of 34 phonemes distributed in 32.072 instances (in 5....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000